Designing better data plots at the Burnet Institute

Dianne Cook
Monash University

Outline

time topic
3:00 Quantitatively assessing the best plot design, and incorporating uncertainty
3:30 Styling and theming plots, and writing alt-text descriptions
4:00 Polishing your plots

Quantitatively assessing the best plot design

Better design

The same procedure can be used to compare different plot designs.

If the real plot is detected faster and more often amongst a page of decoys, using one design in comparison to another, then that design is better.

00:20

  • There are 12 plots on the page. Your job is to pick the one that is the most different.
  • Keep your answer secret until asked to reveal.
  • We will split into two groups, A and B.
  • You will be in either A or B. If in B, shut your eyes until I tell you to open them.

Process

  1. Create Lineup Data: assuming that at least two variables, X and Y are involved in the design, we create data for a lineup of size \(m\) by creating \(m−1\) permutations of Y or, in the case of a simulation study, drawing m−1 samples of size n (the number of rows in the data) from the null distribution. Add the original data to the lineup data randomly between 1 and m. The R package nullabor provides a framework for easy creation of lineup data.
  2. Create lineups from competing designs: using the same data, render lineups of all competing designs.
  3. Evaluate Lineups: by presenting the lineups to independent observers. Assess both signal strength and time needed by individuals to come to a decision. Note that each observer should only be exposed to each lineup data once.
  4. Evaluate Competing Designs: differences in signal strength or time to decision are due to differences in the design. In the case that individuals were shown multiple lineups (as part of a bigger study), it is possible to correct outcome measurements for an individual’s visual ability.

Foundation

Tidy data and random variables

  • Tidy data mirrors elementary statistics
  • Tabular form puts variables in columns and observations in rows
  • Not all tabular data is in this form
  • In this form, we can think about \(X_1 \sim N(0,1), ~~X_2 \sim \text{Exp}(1) ...\)

\[\begin{align}X &= \left[ \begin{array}{rrrr} X_1 & X_2 & ... & X_p \end{array} \right] \\ &= \left[ \begin{array}{rrrr} X_{11} & X_{12} & ... & X_{1p} \\ X_{21} & X_{22} & ... & X_{2p} \\ \vdots & \vdots & \ddots& \vdots \\ X_{n1} & X_{n2} & ... & X_{np} \end{array} \right]\end{align}\]

Grammar of graphics and statistics

  • A statistic is a function on the values of items in a sample, e.g. for \(n\) iid random variates \(\bar{X}_1=\displaystyle\sum_{i=1}^n X_{i1}\), \(s_1^2=\displaystyle\frac{1}{n-1}\displaystyle\sum_{i=1}^n(X_{i1}-\bar{X}_1)^2\)
  • We study the behaviour of the statistic over all possible samples of size \(n\).
  • The grammar of graphics is the mapping of (random) variables to graphical elements, making plots of data into statistics

Example 1

00:20


If examining two distributions of categorical variables what would be NOT INTERESTING?


both being the same is not interesting

  • simulate 11 data sets where the distribution is the same for both as the null sets.
  • plot these data in the same lineup structure using the different designs
Code
set.seed(706)
pos <- sample(1:12)
nobs <- 264
prob <- tibble(s1 = c(1, 4, 5, 2, 3)/15,
               s2 = c(5, 4, 3, 1, 1)/15) 
d <- tibble(v1 = factor(sample(LETTERS[1:5],
                        size = nobs, replace = TRUE, 
                        prob = prob$s1),
                        levels=LETTERS[1:5]),
            v2 = factor(sample(LETTERS[1:5],
                        size = nobs, replace = TRUE, 
                        prob = prob$s2),
                        levels=LETTERS[1:5]))
d_long <- d |>
  count(v1, v2) |>
  mutate(p = n/15) |>
  pivot_longer(cols = c(v1, v2), 
               names_to = "var", 
               values_to = "val") |>
  mutate(.sample = pos[1])
d_lineup <- d_long
dd_lineup <- d |>
  mutate(.sample = pos[1])
# simulate nulls
for (i in 2:12) {
  d <- tibble(v1 = factor(sample(LETTERS[1:5],
                        size = nobs, replace = TRUE, 
                        prob = prob$s1),
                        levels=LETTERS[1:5]),
            v2 = factor(sample(LETTERS[1:5],
                        size = nobs, replace = TRUE, 
                        prob = prob$s1),
                        levels=LETTERS[1:5]))
  d_long <- d |>
    count(v1, v2) |>
    pivot_longer(cols = c(v1, v2), 
               names_to = "var", 
               values_to = "val") |>
  mutate(.sample = pos[i])
  d_lineup <- bind_rows(d_lineup, d_long)
  d <- d |>
    mutate(.sample = pos[i])
  dd_lineup <- bind_rows(dd_lineup, d)
}

Incorporating uncertainty

Why?

Adding representation of the uncertainty in a visualisation should:

  1. Reinforce signals that are important
  2. Hide signals that are primarily noise

to enable making the better decisions and conclusions, or dare we say, inference.

Common options

Melbourne pedestrian counts at Southern Cross Station, October 2025.

Code
load("data/ped_Oct2025.rda")
ped_sc <- ped |>
  filter(Sensor == "Southern Cross Station") |>
  filter(wday(Date) == 5) |>
  group_by(Time) |>
  summarise(ave = mean(Count),
            mx = max(Count),
            mn = min(Count),
            dif = mx - mn, 
           .groups = "drop") 
b1 <- ggplot(ped_sc, aes(x=Time, y=ave)) +
  geom_col(fill = "#20794D") +
  xlab("Hour") + ylab("Count")
b2 <- ggplot(ped_sc, aes(x=Time, y=ave)) +
  geom_col(fill = "#b9ca4a") +
  geom_errorbar(aes(ymin = mn, ymax = mx),
    width=0.5, colour="#20794D") +
  xlab("Hour") + ylab("Count")
b3 <- ggplot(ped_sc, aes(x=Time,
    ydist=distributional::dist_normal(ave, dif))) +
  stat_pointinterval(colour = "#20794D") +
  xlab("Hour") + ylab("Count")
b4 <- ggplot(ped_sc, aes(x=Time,
    ydist=distributional::dist_normal(ave, dif/2))) +
  stat_gradientinterval(colour = NA, fill="#20794D", 
    .width=1) +
  geom_line(aes(x=Time, y=ave), colour="#20794D") +
  xlab("Hour") + ylab("Count")
b5 <- ggplot(ped_sc, aes(x=Time, y=ave)) +
  geom_ribbon(aes(ymin = mn, 
                  ymax = mx),
    fill = "#b9ca4a") +
  geom_line(colour="#20794D") +
  xlab("Hour")
ped_sc_ci <- ped_sc |>
  mutate(l50 = ave - 0.50*dif,
         u50 = ave + 0.50*dif,
         l80 = ave - 0.8*dif,
         u80 = ave + 0.8*dif,
         l99 = ave - 0.99*dif,
         u99 = ave + 0.99*dif
  ) |>
  pivot_longer(cols=l50:u99, names_to = "intprob", 
    values_to="value") |>
  mutate(bound = str_sub(intprob, 1, 1),
       prob = str_sub(intprob, 2, 3)) |>
  select(Time, ave, prob, bound, value) |>
  pivot_wider(names_from = bound, values_from = value) 
b6 <- ggplot(ped_sc_ci, aes(x=Time, y=ave)) +
  geom_lineribbon(aes(ymin = l, ymax = u, fill = prob)) +
  labs(x="Hour", fill="Confidence") +
  scale_fill_discrete_sequential(palette = "Greens", 
    rev=FALSE, n=5) +
  theme(legend.position = "none")
b7 <- ggplot(ped_sc, aes(x=Time, y=ave)) +
  geom_smooth(colour = "#20794D", fill = "#b9ca4a", span=0.4) +
  geom_point(colour = "#20794D") +
  xlab("Hour")

Broader applicability

The approaches used on the barcharts here are the same approaches that apply to many other types of displays.

  • Error bars
  • Error bands
  • Gradients
  • Multiple samples, such as bootstrap or simulation


Defining uncertainty, or what metric to calculate, is not often straightforward.

Styling and theming

Styling

The BBC cookbook has good basic advice for journalism and reports. The work of Amanda Cox has been instrumental in the NY Times data visualisations.

The Royal Statistical Society provides a Best Practices for Data Visualisation with its own RSSthemes package for ggplot2, for academic publications.



The default ggplot2 theme theme_grey() is designed to give the data plot the same ink strength on the page as the surrounding text.

Australia has maintained its status as a low-incidence tuberculosis (TB) country over the past decade, with notification rates that have remained relatively stable despite global fluctuations in TB burden. The country’s TB epidemiology presents a unique profile characterized by consistent low domestic transmission and a disease burden heavily concentrated among overseas-born populations.

Australia reports approximately 1300 cases of TB per year and has a TB case notification rate of 5.5 cases per 100,000 population, though recent data suggests this rate has shown slight variations. Incidence of tuberculosis (per 100,000 people) in Australia was reported at 6.2 in 2023, indicating a modest increase from historical averages. In 2015 this was 5.3 per 100,000 population per year, corresponding to 1,244 individual notifications, demonstrating the relatively stable nature of TB incidence in the country. The consistency of these figures over the decade reflects Australia’s effective TB control measures and robust public health surveillance systems. This rate has essentially remained unchanged since the mid-1980s, however a slight increase in rates has been observed since 2003, suggesting a gradual but measurable trend that health authorities continue to monitor closely.

Fig 1. TB incidence in Australia 1980-2021. Initially incidence dropped but it has been steadily climbing in the recent two decades. Note that, counts are not population adjusted.

A defining characteristic of Australia’s TB epidemiology is the overwhelming concentration of cases among overseas-born populations. Between 88% to 95% of TB cases in Australia have been reported in the overseas-born population, highlighting the critical role of migration patterns in shaping the country’s TB landscape. This demographic distribution has remained consistent throughout the past decade and represents one of the most significant epidemiological features of TB in Australia. The Australian-born population experiences markedly lower TB rates, with specific risk factors identified in vulnerable groups. Research from Victoria reveals that the most common risk factor in the 0–14 year age group was a household contact with tuberculosis (85.1%), followed by having a parent from a high tuberculosis incidence country (70.2%). These findings underscore the importance of contact tracing and screening programs, particularly for children in households with overseas-born parents from high-burden countries.

Overall themes

The ggthemes package supplements the handful available in ggplot2.

These are convenient definitions of the array of style choices for a data plot that include background, position and sizing of title and axis text, legend position and arrangement, axes, ticks and grid lines, extra space at plot edges, …

Code
library(patchwork)
p1 <- tb_aus_p + theme_grey() +
  theme(aspect.ratio = 0.5) + ggtitle("default")
p2 <- tb_aus_p + theme_minimal() +
  theme(aspect.ratio = 0.5) + ggtitle("minimal")
p3 <- tb_aus_p + theme_tufte() +
  theme(aspect.ratio = 0.5) + ggtitle("tufte")
p4 <- tb_aus_p + theme_economist() +
  theme(aspect.ratio = 0.5) + ggtitle("economist")
p1 + p2 + p3 + p4 + plot_layout(ncol=2)

Setting theme elements

A theme for all plots was specified for these two slide decks.

theme_set(ggthemes::theme_gdocs(base_size = 14) +
  theme(plot.background = 
        element_rect(fill = 'transparent', colour = NA),
        axis.line.x = element_line(color = "black", 
                                   linewidth = 0.4),
        axis.line.y = element_line(color = "black", 
                                   linewidth = 0.4),
        panel.grid.major = element_line(color = "grey90"),
        axis.ticks = element_line(color = "black"),
        plot.title.position = "plot",
        plot.title = element_text(size = 14),
        panel.background  = 
          element_rect(fill = 'transparent', colour = "black"),
        legend.background = 
          element_rect(fill = 'transparent', colour = NA),
        legend.key        = 
          element_rect(fill = 'transparent', colour = NA)
  ) 
)

Basic theme: theme_gdocs()

  • Text size adjusted
  • Background set
  • Axes lines adjusted
  • Grid lines colour changed
  • Title position changed

Colour palettes

The colorspace package has the most comprehensive set of palettes and tools for assessing, or creating new palettes.

If you have mapped the variables correctly, changing the colours only requires using the assignment of a new palette.

Code
tb_age <- tb_tidy |> 
  filter(!(age %in% c("0-14", "unknown"))) |>
  ggplot(aes(x = year, 
             y = count, 
             colour = age)) +
  geom_point() +
  geom_smooth(se=F) +
  facet_wrap(~sex, ncol = 2) +
  scale_x_continuous("year", 
    breaks = seq(1998, 2012, 2), 
    labels = c("98", "00", "02", "04", "06", "08", "10", "12")) +
  theme(axis.text = element_text(size=10)) 
tb_age  

Code
tb_age +
  scale_color_discrete_divergingx(palette="Zissou 1") 

Code
tb_age +
  scale_color_discrete_divergingx(palette="Geyser") 

Code
tb_age +
  scale_color_discrete_sequential(palette="OrYel") 

Colorblind proofing

The colorspace package has several functions deutan(), protan(), tritan() which simulates color vision deficiencies.

You need to create the changed colour palette outside the plotting code, and then apply it manually.

Also the package dichromat has tools to simulate color deficiencies.

Code
clrs <- deutan(scales::hue_pal()(6))
tb_age + scale_colour_manual("", values = clrs)

Code
clrs <- deutan(divergingx_hcl(6, "Zissou 1"))
tb_age +
  scale_colour_manual("", values = clrs)

Code
clrs <- deutan(divergingx_hcl(6, "Geyser"))
tb_age +
  scale_colour_manual("", values = clrs)

Code
clrs <- deutan(sequential_hcl(6, "OrYel"))
tb_age +
  scale_colour_manual("", values = clrs)

Creating ensemble of plots

One plot is often not enough, for a report or an academic publication.

Creating and arranging is an art.

The packages patchwork and cowplot can help with layout.

For the tuberculosis data, suppose we want one overview plot, and then separate displays showing sex and age effects.

Code
ov <- tb_tidy |>
  group_by(year) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, y=count)) +
    geom_col() +
    annotate("text", x=1996, y=320, label="A", size=8) +
    xlim(c(1996, 2013))
sex <- tb_tidy |>
  group_by(year, sex) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, weight=count, fill=sex)) +
    geom_bar(position="fill") +
    scale_fill_discrete_divergingx(palette = "TealRose",
      rev=TRUE) +
    ylab("proportion") +
    annotate("text", x=1996, y=0.95, label="B", size=8) +
    xlim(c(1996, 2013))
age <- tb_tidy |>
  group_by(year, age) |>
  summarise(count = sum(count)) |>
  ggplot(aes(x=year, weight=count, fill=age)) +
    geom_bar(position="fill") +
    scale_fill_discrete_sequential(palette = "Sunset") +
    ylab("proportion") +
    annotate("text", x=1996, y=0.95, label="C", size=8) +
    xlim(c(1996, 2013))

ov + sex/age + plot_layout(widths=c(2,1))

Adding alt text

Alt text, or alternative text, is a descriptive text alternative for images on a document that serves two main purposes:

  • accessibility and search engine optimization (SEO). It allows screen readers to describe images to visually impaired users and
  • helps search engines understand the image’s content to improve search rankings.

How to write alt text for data visualisations is not well-documented.

Polishing your plots

Resources

End of session 2

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.